Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers
نویسندگان
چکیده
This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PIJhlMA package includes not only the non-transposed matrix multiplication routine C = A . B. but also transposed multiplication routines C = AT . B, C = A . BT, and C = AT . BT, for a block scattered data distribution. The routines perform efficiently for a wide rauge of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.
منابع مشابه
Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P Q processor template with a block scattered data distribution. P , Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD...
متن کاملA New Direction to Parallelize Winograd's Algorithm on Distributed Memory Computers
Winograd’s algorithm to multiply two n × n matrices reduces the asymptotic operation count from O(n3) of the traditional algorithm to O(n2.81), thus on distributed memory computers, the association of Winograd’s algorithm and the parallel matrix multiplication algorithms always gives remarkable results. Within this association, the application of Winograd’s algorithm at the inter-processor leve...
متن کاملThe Spectral Decomposition of Nonsymmetric Matrices on Distributed Memory Parallel Computers
The implementation and performance of a class of divide-and-conquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divide-and-conquer (SDC) algorithm with Newton iteration. Although the algorithm requires several times as many oating poin...
متن کاملA Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers
We present a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and call it DIMMA1 (Distribution-Independent Matrix Multiplication Algorithm). The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectivel...
متن کاملComparison of Scalable Parallel Matrix Multiplication Libraries
This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2-D torus wrap. The algorithmic diierences resulting from these diierent layouts are discussed as well as the general issues associated with diierent data layouts for library routines. Results on the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Concurrency - Practice and Experience
دوره 6 شماره
صفحات -
تاریخ انتشار 1994